This project involves cleaning, exploring, and modeling crime data from Portland, Oregon (2015–2023). After addressing missing data and reducing noise, we uncovered key trends in offense types, report times, and their relationship to neighborhood, day, and time. We then developed an XGBoost machine learning model that predicts daily total offense counts by neighborhood, currently achieving an R² score of 0.62. This model could support better resource planning and crime prevention strategies across the city.
Address: Address of reported incident at the 100 block level (e.g.: 1111 SW 2nd Ave would be 1100 Block SW 2nd Ave).
Case Number: The case year and number for the reported incident (YY-######).
Crime Against: Crime against category (Person, Property, or Society).
Neighborhood: Neighborhood where incident occurred. If the neighborhood name is missing, the incident occurred outside of the boundaries of the Portland neighborhoods or at a location that could not be assigned to a specific address in the system. (e.g., Portland, near Washington Park, on the streetcar, etc.).
Occur Date: Date the incident occurred. The exact occur date is sometimes unknown. In most situations, the first possible date the crime could have occurred is used as the occur date. (For example, victims return home from a week-long vacation to find their home burglarized. The burglary could have occurred at any point during the week. The first date of their vacation would be listed as the occur date.)
Occur Time: Time the incident occurred. The exact occur time is sometimes unknown. In most situations, the first possible time the crime could have occurred is used as the occur time. The time is reported in the 24-hour clock format, with the first two digits representing hour (ranges from 00 to 23) and the second two digits representing minutes (ranges from 00 to 59).
Offense Category: Category of offense (for example, Assault Offenses).
Offense Type: Type of offense (for example, Aggravated Assault)Note: The statistic for Homicide Offenses has been updated in the Group A Crimes report to align with the 2019 FBI NIBRS definitions. The statistic for Homicide Offenses includes (09A) Murder & Non-negligent Manslaughter and (09B) Negligent Manslaughter. As of January 1, 2019, the FBI expanded the definition of negligent manslaughter to include traffic fatalities that result in an arrest for driving under the influence, distracted driving, or reckless driving. The change in definition impacts the 2019 homicide offenses statistic and the comparability of 2019 homicide statistics to prior year.
Open Data Lat/Lon: Generalized Latitude / Longitude of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block’s midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid.
Open Data X/Y: Generalized XY point of the reported incident. For offenses that occurred at a specific address, the point is mapped to the block’s midpoint. Offenses that occurred at an intersection is mapped to the intersection centroid. To protect the identity of victims and other privacy concerns, the points of certain case types are not released. XY points use the Oregon State Plane North (3601), NAD83 HARN, US International Feet coordinate system.
Offense Count: Number of offenses per incident. Offenses (i.e. this field) are summed for counting purposes.
Thinking forward to the analysis that I want to perform with this data, I need to understand what I am looking for when it comes to cleaning. I know that I want to focus my analysis on the temporal crime trends across the various neighborhoods of Portland. Based on this understanding, I get a better sense of what aspects of the data need to be cleaned.
Address column seems to be redundant as most entries are just a general location
Drop Address column
Neighborhood averages can be used to find lat/lon
Drop rows with missing Neighborhood and OpenDataLat
Replace all rows with neighborhood but missing Lat/Lon data with average Lat/Lon of their neighborhood
Data Cleaning
Cleaning
# Calculate average Lat/Lon for each neighborhoodneighborhood_means = pcrime_combined.groupby('Neighborhood')[['OpenDataLat','OpenDataLon']].transform('mean')# Clean the datapcrime_cleaned = ( pcrime_combined .drop(columns=['Address', 'OpenDataX', 'OpenDataY']) # Drop X/Y .dropna(subset=['OpenDataLat', 'Neighborhood'], how='all') # Drop missing lat/lon and Neighborhoods .assign( OccurDate=pd.to_datetime(pcrime_combined['OccurDate']), # Convert dates to datetime week=lambda x: x.OccurDate.dt.isocalendar().week, year=lambda x: x.OccurDate.dt.year, month=lambda x: x.OccurDate.dt.month, dayofmonth=lambda x: x.OccurDate.dt.day, ReportDate=pd.to_datetime(pcrime_combined['ReportDate']), ReportDiff=lambda x: (x['ReportDate'] - x['OccurDate']).dt.days, # Calculate time to report OpenDataLat=lambda x: x['OpenDataLat'].fillna(neighborhood_means['OpenDataLat']), # Fill missing Lat/Lon with average Lat/Lon of given neighborhood OpenDataLon=lambda x: x['OpenDataLon'].fillna(neighborhood_means['OpenDataLon']), OccurTime=lambda x: x['OccurTime'].astype(str).str.zfill(4), # Ensure time is in HHMM format OccurDateTime=lambda x: pd.to_datetime( x['OccurDate'].dt.strftime('%Y-%m-%d') +' '+ x['OccurTime'].str[:2] +':'+ x['OccurTime'].str[2:]), # Combine date and formatted time into datetime OccurHour=lambda x: x.OccurDateTime.dt.hour, ) .loc[lambda x: x['OccurDateTime'].dt.year.between(2015, 2023)] # Filter rows with years within 2015–2023)pcrime_cleaned
We have now cleaned our data into a useable state for our analysis. We went from many missing rows from in many columns to only 7881 missing rows in the neighborhood column.
Note: Further cleaning of the missing neighborhood rows could be done using a reverse geocoding API, however, that is beyond the scope of this project
This chart shows that crime levels remained fairly consistent in 2019. In 2020, we see a noticeable drop when the country went into lockdown, followed by a sharp increase as restrictions eased in the summer. Then, in 2021, Portland experienced a significant surge in crime, which remained relatively high until 2023, when it began to stabilize.
Other Temporal Counts
# Month Countmonth_count = pcrime_cleaned.groupby(pcrime_cleaned['OccurDateTime'].dt.month)['OffenseCount'].sum().reset_index()month_count.rename(columns={month_count.columns[0]: 'Month'}, inplace=True)month_count_fig = ggplot(month_count, aes(x="Month", y="OffenseCount")) +\ geom_line(color='#2e6f40', size=1.5) +\ geom_point(color='#2e6f40', size=3) +\ labs(title='Offense Counts by Month', x='Month', y='Offense Count') +\ scale_x_continuous(breaks=list(range(1, 13))) +\ theme_minimal() +\ theme( plot_title=element_text(size=16, face='bold'), axis_title_x=element_text(size=12, face='bold'), axis_title_y=element_text(size=12, face='bold'), axis_text_x=element_text(size=10), axis_text_y=element_text(size=10), panel_grid_minor=element_blank() )# Weekday Countweekday_count = pcrime_cleaned.groupby(pcrime_cleaned['OccurDateTime'].dt.weekday)['OffenseCount'].sum().reset_index()weekday_count.rename(columns={weekday_count.columns[0]: 'Weekday'}, inplace=True)weekday_count_fig = ggplot(weekday_count, aes(x="Weekday", y="OffenseCount")) +\ geom_line(color='#2e6f40', size=1.5) +\ geom_point(color='#2e6f40', size=3) +\ labs(title='Offense Counts by Weekday', x='Weekday', y='Offense Count') +\ theme_minimal() +\ theme( plot_title=element_text(size=16, face='bold'), axis_title_x=element_text(size=12, face='bold'), axis_title_y=element_text(size=12, face='bold'), axis_text_x=element_text(size=10), axis_text_y=element_text(size=10), panel_grid_minor=element_blank() )# Hour Counthour_count = pcrime_cleaned.groupby(pcrime_cleaned['OccurDateTime'].dt.hour)['OffenseCount'].sum().reset_index()hour_count.rename(columns={hour_count.columns[0]: 'Hour'}, inplace=True)hour_count_fig = ggplot(hour_count, aes(x="Hour", y="OffenseCount")) +\ geom_line(color='#2e6f40', size=1.5) +\ geom_point(color='#2e6f40', size=3) +\ labs(title='Offense Counts by Hour', x='Hour', y='Offense Count') +\ scale_x_continuous(breaks=list(range(0, 25))) +\ theme_minimal() +\ theme( plot_title=element_text(size=16, face='bold'), axis_title_x=element_text(size=12, face='bold'), axis_title_y=element_text(size=12, face='bold'), axis_text_x=element_text(size=10), axis_text_y=element_text(size=10), panel_grid_minor=element_blank() )# Create a list of plots to display in a single rowtemp_plot_list = [ month_count_fig, weekday_count_fig, hour_count_fig]# Arrange the plots in a single rowtemp_plots = gggrid(temp_plot_list, ncol=3) + ggsize(1200, 400)# Show the combined plottemp_plots
Observations
Crime patterns exhibit distinct temporal trends across months, weekdays, and hours. Monthly data shows that crime tends to slow down during the winter months and gradually rises through the summer and into the rest of the year, potentially influenced by seasonal factors such as weather and increased outdoor activity. Looking at weekly patterns, Friday stands out as the day with the highest number of reported offenses, which aligns with the start of the weekend when more people are out, creating more opportunities for crime. Additionally, crime follows a predictable daily cycle, with certain hours experiencing higher offense counts. These trends suggest that external factors like weather, social behavior, and law enforcement presence may play a role in crime fluctuations, warranting further analysis to uncover deeper insights.
It’s evident that crimes against property are much more common than other categories. My initial thought is that property crimes may be more frequent because they’re often easier to commit, both physically and morally. Property doesn’t involve direct harm to individuals, which could make it feel less risky or less severe to potential offenders. The differences in report times are also interesting. All crime against types have a median report time of 1 day, however, crime against person has a larger distribution of report times. This also makes sense because many crimes against a person are very sensitive situations that lead to delayed reporting.
Larceny stands out as the most common offense in Portland, reinforcing the broader trend that property crimes are significantly more prevalent than other crime categories. This may be attributed to the opportunistic nature of larceny—these offenses often require little planning and can happen quickly, unlike more serious crimes that demand time, effort, or emotional involvement. Additionally, neighborhood-level analysis shows that Downtown and Hazelwood experience disproportionately high numbers of reported offenses. While this highlights potential crime hotspots, further context—such as population density, neighborhood size, and visitor traffic—would provide a more accurate understanding of these patterns.
Machine Learning
The goal of this machine learning component is to predict the total number of offenses reported each day in each Portland neighborhood. Using historical crime data from 2015 to 2023, we trained an XGBoost regression model that captures both spatial and temporal patterns in crime activity. With an R² score of 0.62, the model shows promising potential to assist city officials in making data-informed decisions about where and when to allocate law enforcement resources to keep the community safe.
Collecting xgboost
Downloading xgboost-3.0.1-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Requirement already satisfied: numpy in /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages (from xgboost) (2.2.5)
Collecting nvidia-nccl-cu12 (from xgboost)
Downloading nvidia_nccl_cu12-2.26.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
Requirement already satisfied: scipy in /opt/hostedtoolcache/Python/3.11.12/x64/lib/python3.11/site-packages (from xgboost) (1.15.3)
Downloading xgboost-3.0.1-py3-none-manylinux_2_28_x86_64.whl (253.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0.0/253.9 MB? eta -:--:--━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━9.7/253.9 MB51.4 MB/s eta 0:00:05━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━24.6/253.9 MB63.1 MB/s eta 0:00:04━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━35.4/253.9 MB60.3 MB/s eta 0:00:04━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━44.6/253.9 MB56.2 MB/s eta 0:00:04━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━54.5/253.9 MB54.9 MB/s eta 0:00:04━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━64.2/253.9 MB53.7 MB/s eta 0:00:04━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━78.1/253.9 MB55.9 MB/s eta 0:00:04━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━93.1/253.9 MB58.2 MB/s eta 0:00:03━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━100.9/253.9 MB56.7 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━109.3/253.9 MB54.7 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━119.0/253.9 MB54.1 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━127.9/253.9 MB53.2 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━139.2/253.9 MB53.4 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━150.5/253.9 MB53.6 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━157.8/253.9 MB52.5 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━164.1/253.9 MB51.2 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━170.7/253.9 MB50.0 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━172.0/253.9 MB50.0 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━177.2/253.9 MB46.5 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━188.7/253.9 MB47.0 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━197.7/253.9 MB46.8 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━208.7/253.9 MB47.2 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━225.4/253.9 MB48.8 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━239.9/253.9 MB49.7 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺249.6/253.9 MB49.7 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸253.8/253.9 MB49.7 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸253.8/253.9 MB49.7 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸253.8/253.9 MB49.7 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━253.9/253.9 MB44.6 MB/s eta 0:00:00
Downloading nvidia_nccl_cu12-2.26.5-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (318.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0.0/318.1 MB? eta -:--:--╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━7.6/318.1 MB38.6 MB/s eta 0:00:09━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━10.2/318.1 MB25.5 MB/s eta 0:00:13━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━20.2/318.1 MB33.8 MB/s eta 0:00:09━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━27.8/318.1 MB34.8 MB/s eta 0:00:09━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━35.1/318.1 MB35.1 MB/s eta 0:00:09━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━42.5/318.1 MB35.2 MB/s eta 0:00:08━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━53.0/318.1 MB37.7 MB/s eta 0:00:08━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━59.2/318.1 MB36.8 MB/s eta 0:00:08━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━68.9/318.1 MB38.1 MB/s eta 0:00:07━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━80.5/318.1 MB40.0 MB/s eta 0:00:06━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━━92.8/318.1 MB42.0 MB/s eta 0:00:06━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━━━━102.8/318.1 MB42.6 MB/s eta 0:00:06━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━━━113.5/318.1 MB43.4 MB/s eta 0:00:05━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━━125.0/318.1 MB44.4 MB/s eta 0:00:05━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━━━128.2/318.1 MB42.5 MB/s eta 0:00:05━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━━━━━134.5/318.1 MB41.8 MB/s eta 0:00:05━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━━145.0/318.1 MB42.4 MB/s eta 0:00:05━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━━153.9/318.1 MB42.5 MB/s eta 0:00:04━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━━━━164.6/318.1 MB43.1 MB/s eta 0:00:04━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━177.5/318.1 MB44.1 MB/s eta 0:00:04━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━━190.6/318.1 MB45.2 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━━━198.4/318.1 MB44.8 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━━━━210.8/318.1 MB45.5 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━217.3/318.1 MB45.0 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━226.0/318.1 MB44.9 MB/s eta 0:00:03━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━━235.9/318.1 MB45.1 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━━━━━244.3/318.1 MB45.0 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━255.9/318.1 MB45.4 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━━━━━━262.9/318.1 MB45.1 MB/s eta 0:00:02━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━━275.8/318.1 MB47.0 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━━284.4/318.1 MB47.0 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸━━━293.9/318.1 MB47.4 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╺━303.3/318.1 MB48.0 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸315.9/318.1 MB48.9 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸318.0/318.1 MB48.9 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸318.0/318.1 MB48.9 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸318.0/318.1 MB48.9 MB/s eta 0:00:01━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━318.1/318.1 MB42.8 MB/s eta 0:00:00
Installing collected packages: nvidia-nccl-cu12, xgboost
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━0/2 [nvidia-nccl-cu12]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━1/2 [xgboost]
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━2/2 [xgboost]
Successfully installed nvidia-nccl-cu12-2.26.5 xgboost-3.0.1
XGBoost Model
X = pcrime_ml_daily.drop(columns=['total_offenses','OccurDate','month','weekday'],axis=1)X = pd.get_dummies(X)y = pcrime_ml_daily['total_offenses']# Split data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Train a Random Forest Regressorxgb = XGBRegressor( max_depth=10, learning_rate=0.03, n_estimators=500, min_child_weight=25, subsample=.5, colsample_bytree=.6, random_state=42)xgb.fit(X_train, y_train)# Make predictionsy_pred = xgb.predict(X_test)# Evaluate the modelmse = mean_squared_error(y_test, y_pred)rmse = root_mean_squared_error(y_test,y_pred)r2 = r2_score(y_test, y_pred)print(f"Mean Squared Error: {mse}")print(f'RMSE: {rmse}')print(f"R^2 Score: {r2}")
Mean Squared Error: 2.6946589946746826
RMSE: 1.641541600227356
R^2 Score: 0.6257736682891846
Feature Importance
import plotly.express as pximport matplotlib.pyplot as pltimport seaborn as sns# Assuming 'xgb' is your trained XGBRegressor model and 'X' is your feature dataframefeature_importances = pd.DataFrame({'Feature': X.columns, 'Importance': xgb.feature_importances_})feature_importances = feature_importances.sort_values(by='Importance', ascending=False).head(20)# Create a bar chartplt.figure(figsize=(10, 6))sns.barplot(x='Importance', y='Feature', data=feature_importances) # Top 15 featuresplt.title('Top Feature Importances')plt.xlabel('Importance')plt.ylabel('Feature')plt.show()
Current Final Observations
Right now, the model has an R² score of 0.62, meaning it can explain about 62% of the variation in daily crime counts. This is a solid result, especially for real-world human behavior. Still, there’s room to improve. Adding more information about each neighborhood—like population, income, education levels, and unemployment—could give the model a better understanding of what makes each area unique. This extra context could help the model make even more accurate predictions.